引入下面的记号:
神经网络第$l$层的计算过程:$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最早的 M-P 模型采用阶跃函数$\sgn(\cdot)$作为激活函数
改进方向:
常见的有
将$\Rbb$挤压到$[0,1]$,输出拥有概率意义:
$$ \begin{align*} \qquad \sigma(z) = \frac{1}{1 + \exp (-z)} = \begin{cases} 1, & z \rightarrow \infty \\ 0, & z \rightarrow -\infty \end{cases} \end{align*} $$
对率函数连续可导,在零处导数最大
$$ \begin{align*} \qquad \nabla \sigma(z) = \sigma(z) (1 - \sigma(z)) \le \left( \frac{\sigma(z) + 1 - \sigma(z)}{2} \right)^2 = \frac{1}{4} \end{align*} $$
均值不等式等号成立的条件是$\sigma(z) = 1 - \sigma(z)$,即$z = 0$
将$\Rbb$挤压到$[-1,1]$,输出零中心化,对率函数的放大平移
$$ \begin{align*} \qquad \tanh(z) & = \frac{\exp(z) - \exp(-z)}{\exp(z) + \exp(-z)} = \frac{1 - \exp(-2z)}{1 + \exp(-2z)} = 2 \sigma(2z) - 1 \\[2pt] & = \begin{cases} 1, & z \rightarrow \infty \\ -1, & z \rightarrow -\infty \end{cases} \\[10pt] \nabla \tanh(z) & = 4 \sigma(2z) (1 - \sigma(2z)) \le 1 \end{align*} $$
双曲正切函数连续可导,在$z = 0$处导数最大
输出零中心化使得非输入层的输入都在零附近,而双曲正切函数在零处导数最大,梯度下降更新效率较高,对率函数输出恒为正,会减慢梯度下降的收敛速度
整流线性单元 (rectified linear unit, ReLU):
$$ \begin{align*} \qquad \relu(z) = \max \{ 0, z \} = \begin{cases} z & z \ge 0 \\ 0 & z < 0 \end{cases} \end{align*} $$
优点
缺点
由链式法则有
$$ \begin{align*} \qquad \nabla_{\wv} \relu(\wv^\top \xv + b) & = \frac{\partial \relu(\wv^\top \xv + b)}{\partial (\wv^\top \xv + b)} \frac{\partial (\wv^\top \xv + b)}{\partial \wv} \\ & = \frac{\partial \max \{ 0, \wv^\top \xv + b \}}{\partial (\wv^\top \xv + b)} \xv \\ & = \Ibb(\wv^\top \xv + b \ge 0) \xv \end{align*} $$
如果第一个隐藏层中的某个神经元对应的$(\wv,b)$初始化不当,使得对任意$\xv$有$\wv^\top \xv + b < 0$,那么其关于$(\wv,b)$的梯度将为零,在以后的训练过程中永远不会被更新
解决方案:带泄漏的 ReLU,带参数的 ReLU,ELU,Softplus
带泄漏的 ReLU:当$\wv^\top \xv + b < 0$时也有非零梯度
$$ \begin{align*} \qquad \lrelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma z & z < 0 \end{cases} \\ & = \max \{ 0, z \} + \gamma \min \{ 0, z \} \overset{\gamma < 1}{=} \max \{ z, \gamma z \} \end{align*} $$
其中斜率$\gamma$是一个很小的常数,比如$0.01$
带参数的 ReLU:斜率$\gamma_i$可学习
$$ \begin{align*} \qquad \prelu(z) & = \begin{cases} z & z \ge 0 \\ \gamma_i z & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \gamma_i \min \{ 0, z \} \end{align*} $$
可以不同神经元有不同的参数,也可以一组神经元共享一个参数
指数线性单元 (exponential linear unit, ELU)
$$ \begin{align*} \qquad \elu(z) & = \begin{cases} z & z \ge 0 \\ \gamma (\exp(z) - 1) & z < 0 \end{cases} \\[4pt] & = \max \{ 0, z \} + \min \{ 0, \gamma (\exp(z) - 1) \} \end{align*} $$
Softplus 函数可以看作 ReLU 的平滑版本:
$$ \begin{align*} \qquad \softplus(z) = \ln (1 + \exp(z)) \end{align*} $$
其导数为对率函数
$$ \begin{align*} \qquad \nabla \softplus(z) = \frac{\exp(z)}{1 + \exp(z)} = \frac{1}{1 + \exp(-z)} \end{align*} $$
Swish 函数是一种自门控 (self-gated) 激活函数:
$$ \begin{align*} \qquad \swish(z) = z \cdot \sigma (\beta z) = \frac{z}{1 + \exp(-\beta z)} \end{align*} $$
其中$\beta$是可学习的参数或一个固定超参数
考虑神经网络的第$l$层:
$$ \begin{align*} \qquad \zv_l & = \Wv_l \av_{l-1} + \bv_l \\ \av_l & = h_l (\zv_l) \end{align*} $$
前面提到的激活函数都是$\Rbb \mapsto \Rbb$的,即$[\av_l]_i = h_l ([\zv_l]_i), ~ i \in [n_l]$
Maxout 单元是$\Rbb^{n_l} \mapsto \Rbb$的,输入就是$\zv_l$,其定义为
$$ \begin{align*} \qquad \maxout (\zv) = \max_{k \in [K]} \{ \wv_k^\top \zv + b_k \} \end{align*} $$
前$L-1$层是复合函数$\psi: \Rbb^d \mapsto \Rbb^{n_{L-1}}$,可看作一种特征变换方法
最后一层是学习器$\hat{\yv} = g(\psi(\xv); \Wv_L, \bv_L)$,对输入的$\psi(\xv)$进行预测
对率回归也可看作只有一层(没有隐藏层)的神经网络
传统机器学习:特征工程和模型学习两阶段分开进行
深度学习:特征工程和模型学习合二为一,端到端 (end-to-end)
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
神经网络的优化目标为
$$ \begin{align*} \qquad \min_{\Wv, \bv} ~ \frac{1}{m} \sum_{i \in [m]} \ell (\yv_i, \hat{\yv}_i) \end{align*} $$
其中损失$\ell (\yv, \hat{\yv})$的计算为正向传播
梯度下降更新公式为
$$ \begin{align*} \qquad \Wv ~ \leftarrow ~ \Wv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \Wv}}, \quad \bv ~ \leftarrow ~ \bv - \frac{\eta}{m} \sum_{i \in [m]} \class{yellow}{\frac{\partial \ell (\yv_i, \hat{\yv}_i)}{\partial \bv}} \end{align*} $$
整个网络:$\xv = \av_0 \xrightarrow{\Wv_1,\bv_1} \zv_1 \xrightarrow{h_1} \av_1 \xrightarrow{\Wv_2,\bv_2} \cdots \xrightarrow{\Wv_L,\bv_L} \zv_L \xrightarrow{h_L} \av_L = \hat{\yv}$
最后一层$\zv_L = \Wv_L ~ \av_{L-1} + \bv_L$,$\av_L = h_L (\zv_L)$,由链式法则有
$$ \begin{align*} \qquad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_L} & = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_L} \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \frac{\partial \zv_L}{\partial \bv_L} = \deltav_L^\top \\ \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_L} & = \sum_{j \in [n_L]} \frac{\partial \ell (\yv, \hat{\yv})}{\partial [\zv_L]_j} \frac{\partial [\zv_L]_j}{\partial \Wv_L} = \sum_{j \in [n_L]} [\deltav_L]_j \frac{\partial [\zv_L]_j}{\partial \Wv_L} \end{align*} $$
其中$\deltav_L^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_L \in \Rbb^{n_L}$为第$L$层的误差项,可直接求解
类似的,对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,$\av_l = h_l (\zv_l)$,由链式法则有
$$ \begin{align*} \qquad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \bv_l} = \deltav_l^\top, \quad \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} \end{align*} $$
其中$\deltav_l^\top = \partial \ell (\yv, \hat{\yv}) / \partial \zv_l \in \Rbb^{n_l}$为第$l$层的误差项
反向传播 (backpropagation, BP):前一层误差由后一层得到
$$ \begin{align*} \qquad \deltav_{l-1}^\top = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_{l-1}} = \frac{\partial \ell (\yv, \hat{\yv})}{\partial \zv_l} \frac{\partial \zv_l}{\partial \av_{l-1}} \frac{\partial \av_{l-1}}{\partial \zv_{l-1}} = \deltav_l^\top \Wv_l \frac{\partial h_{l-1}(\zv_{l-1})}{\partial \zv_{l-1}} \end{align*} $$
最后对第$l$层$\zv_l = \Wv_l \av_{l-1} + \bv_l$,如何求$\partial [\zv_l]_j / \partial \Wv_l$?
注意$z_j = \sum_k w_{jk} a_k + b_k$只与$\Wv$的第$j$行有关,于是
$$ \begin{align*} \qquad & \frac{\partial z_j}{\partial \Wv} = \underbrace{\begin{bmatrix} \zerov, \ldots, \av, \ldots, \zerov \end{bmatrix}}_{\text{only }\av\text{ at }j\text{-th column}} = \av \ev_j^\top \\[4pt] \qquad & \Longrightarrow \frac{\partial \ell (\yv, \hat{\yv})}{\partial \Wv_l} = \sum_{j \in [n_l]} [\deltav_l]_j \frac{\partial [\zv_l]_j}{\partial \Wv_l} = \av_{l-1} \sum_{j \in [n_l]} [\deltav_l]_j \ev_j^\top = \av_{l-1} \deltav_l^\top \end{align*} $$
输入:训练集,验证集,相关超参数
输出:$\Wv$和$\bv$
import numpy as np from sklearn.neural_network import MLPClassifier mlp = MLPClassifier( hidden_layer_sizes=(h), # 隐藏层神经元个数 activation='logistic', # identity, logistic, tanh, relu max_iter=100, # 最大迭代轮数 solver='lbfgs', # 求解器 alpha=0, # 正则项系数 batch_size=32, # 批量大小 learning_rate='constant', # constant, invscaling, adaptive shuffle=True, # 每轮是否将样本重新排序, momentum=0.9, # 动量法系数, sgd only nesterovs_momentum=True, # 动量法用Nesterov加速 early_stopping=False, # 是否提早停止 warm_start=False, # 是否开启热启动机制 random_state=1, verbose=False ... ) clf = mlp.fit(X, y) acc = clf.score(X, y)